Modeling latent spatio-temporal disease incidence using penalized composite link models

Epidemiological data are frequently recorded at coarse spatio-temporal resolutions to protect confidential information or to summarize it in a compact manner. However, the detailed patterns followed by the source data, which may be of interest to researchers and public health officials, are overlooked. We propose to use the penalized composite link model (Eilers PCH (2007)), combined with spatio-temporal P-splines methodology (Lee D.-J., Durban M (2011)) to estimate the underlying trend within data that have been aggregated not only in space, but also in time. Model estimation is carried out within a generalized linear mixed model framework, and sophisticated algorithms are used to speed up computations that otherwise would be unfeasible. The model is then used to analyze data obtained during the largest outbreak of Q-fever in the Netherlands.


Introduction
In recent decades, the development of spatio-temporal statistical methods in Epidemiology, the cornerstone of Public Health, has been remarkable. Such development has been mainly driven by advances in geographic information systems (GISs), access to reliable health data registers, and the availability of powerful software capabilities to process and analyze large amounts of data. The methodological contributions to the analysis of spatio-temporal health data come from several interdisciplinary researchers, whose backgrounds are mostly related to Statistics, Geography, Environmental Sciences, and Epidemiology.
Within the diversity of of epidemiological research, disease mapping has attracted much interest in Public Health as it helps to visualize disease incidence of mortality risk patterns in a specific area. To this end, appropriate statistical methods have been applied to health data, which are usually recorded per spatial units, to provide smoothed disease incidences per unit.
Smoothing is performed to obtain more stable and less noisy estimates of the incidence rates associated with each unit [1], which helps to determine meaningful spatial patterns. By adding the temporal dimension to this context, it is possible to examine the evolution of disease (see for example [26]). Geographical microsimulation models simulate populations in given geographical areas, with population characteristics as close as possible to their real counterparts. But a drawback may be that a representative sample of individuals at the fine-scale data required by those methods is not always available. The methodology presented here allows to create detailed dynamic maps for disease incidence data. Publicly available maps in aggregated form over space and time are the only information required, as well as geographical locations and time points where a finer resolution prediction is required.

Q fever data
Q fever is a widespread zoonotic disease caused by the bacterium Coxiella burnetii. C. burnetii transmission to humans is mainly associated with ruminants such as cattle, sheep, and goats. During parturition or abortion of infected animals, high numbers of C. burnetii are shed within the amniotic fluids and the placenta. These organisms end up in the environment, where they may survive for long periods due to their resistance to heat, dryness, and many common disinfectants. Humans are often highly susceptible to the disease, and very few organisms may be necessary to cause infection. More information about this infectious disease is provided in [27]. The Southern Netherlands faced large outbreaks of human Q fever from 2007 to 2010 [28]. In this country, the local municipal health services (MHSs) are responsible for registering all confirmed diagnoses of acute Q fever. The information collected is then entered into the electronic national infectious diseases surveillance database. Due to confidentiality, these data are not publicly available and, in some instances, maybe provided in an aggregated form. In this case, the data were made available monthly at the municipality level. The epidemic peaks of each year were observed every spring, specifically during May. This coincides with the birth period of small ruminants (sheep and goats), a fact that was pointed out in several studies about those exceptionally large Q fever outbreaks in the Netherlands (see, for example, [28,29]). Since the largest outbreak was observed during 2009, we have studied the distribution of Q fever incidence in that year.
Fig 2a shows postal code areas affected by human Q fever (red points) in 2009. However, the number of cases was only publicly available at the municipality level. Since dairy goat and sheep farms are not evenly distributed across the Netherlands, Q-fever infection did not occur in all areas of the country. In this study we have focused on a 60 × 60km area in the south of the Netherlands (see black square in Fig 2a) with a total of 72 municipalities. The total number of Q fever cases reported in these municipalities was 1798. Taking into account the number of inhabitants in each municipality, we can calculate Q fever incidence (per 100000 inhabitants). Fig 2b shows the spatial distribution of the resulting Q fever incidence (aggregated over months in 2009), with higher incidence values observed around the municipalities of Landerd (1439.676), Lith (562.546), and Heusden (295.006). Both Figs 1 and 2 show a smooth distribution of counts in space and time which makes our initial assumption of a smooth latent distribution plausible.

The spatio-temporal penalized composite link model
The Spatio-Temporal Penalized Composite Link Model (ST-PCLM) is the name of the model we propose for the spatio-temporal disaggregation of epidemiological data. ST-PCLM combines the penalized composite link model (PCLM) of [12] and spatio-temporal smoothing with P-splines [13]. The PCLM generalizes the model proposed by [30] and allows the estimation of a smooth trend, from grouped data, at a finer resolution. The idea behind it is as follows: We observe data y from a Poisson distribution with mean μ, then we assume that μ is a composition of the ungrouped distribution γ. The approach assumes that the underlying (ungrouped) distribution is smooth, but otherwise let the data determine their shape.
As PCLM applies to the unidimentional scenario, we develop it in the spatio-temporal context. Then, we have included the ideas presented in [13] to reformulate the new spatio-temporal PCLM under a generalized linear mixed model (GLMM) framework, i.e., ST-PCLM. Hereafter, we will show the development of the ST-PCLM and a computationally efficient way to estimate the ST-PCLM parameters.
The model. Let y it , i = 1, . . ., n, t = 1, . . ., T a , denote count data recorded over n non-overlapping spatial units v i , which constitute the area of interest, at T a time periods. Suppose that

PLOS ONE
we want to estimate the latent distribution of the vector of counts at a spatio-temporal support that is a refinement of the original one. The fine support is determined by three coordinates: x 1 = (x 11 , . . ., x 1m ) 0 and x 2 = (x 21 , . . ., x 2m ) 0 , with m > n, which represent the longitude and latitude coordinates of the spatial refinement, respectively; and x 3 ¼ ðx 31 ; :: which represents the coordinates of the temporal refinement. We assume that the spatial refinement remains fixed at each instant of time in x 3 , but the method can easily be extended to relax this assumption. Assuming that y is Poisson distributed with mean vector μ, the ST-PCLM is given by: where γ f denotes the fine-scale latent mean, C st is the composition matrix that describes how these latent means are combined to yield μ, and f st (x 1 , x 2 , x 3 ) represents the fine-scale spatiotemporal trend. Several approaches can be used to model that function, e.g.: Gaussian random fields, spatio-temporal kriging or penalized splines, among others. We propose the last option since smoothness might be a reasonable property to ask for: if we are estimating an unknown distribution, it seems natural to assume smoothness since we do not know much about the underlying process (the use of the other techniques mentioned is out of the scope of this paper and will be subject of further work). We assume that the non-separable function f st (x 1 , x 2 , x 3 ) may be represented by the product of simpler functions f x 1 ðx 1 Þ, f x 2 ðx 2 Þ, and f x 3 ðx 3 Þ. Therefore, the basis representation of the function would be given by: where B 1 = B(x 1 ), B 2 = B(x 2 ), and B 3 = B(x 3 ) are univariate B-spline bases [16] of dimensions m × c 1 , m × c 2 and T f × c 3 respectively. The matrix operators � and □ on the right-hand side of Eq (2) represent Kronecker and Box products (row tensor products), respectively [31] (if the spatial refinement changes over time, Box product would be used instead of Kronecker product). To achieve smoothness, we use an anisotropic penalty (based on second order differences constructed separately for each independent variable) over the vector of regression coefficients θ. The form of this matrix is as follows: where I c d denotes an identity matrix of dimension c d × c d , λ d is the smoothing parameter that controls the amount of smoothing along the coordinate x d , and P d ¼ D 0 d D d is the marginal penalty matrix based on D d , which computes q d -th differences, i.e., D q d θ ¼ D d θ, for d = 1, 2, 3. As we mentioned earlier, the matrix in Eq (3) has an anisotropic property in the sense that allows a different amount of smoothing for each coordinate.
Since we are assuming that the spatial refinement is fixed over the temporal refinement, the composition matrix C st in model (1) is obtained as: where C s and C t are the spatial and temporal composition matrices of dimensions n × m and T a × T f , respectively. The structure of these matrices depends on the type and level of aggregation. For example, if we want to estimate the latent distribution at a fine spatial grid (over a study area), the entries of the spatial composition matrix will be: where (x 1j , x 2j ) are the cell centroid coordinates of the fine grid, for i = 1, . . ., n and j = 1, . . ., m. Another option to construct C s is to consider its entries as the area proportions that each grid cell shares with a specific unit. The temporal composition matrix C t is used to disaggregate coarse time intervals into detailed time periods (for example, from years or trimesters to months, weeks, or days). In our application, we show the structure that C t will have, specifically for our purposes. Note that if C s = I n , C t ¼ I T a , and the unit centroids are used as spatial coordinates, the presented methodology is reduced to the Poisson version of the proposal given by [13] for smoothing of spatio-temporal count data. When spatial data are recorded over a coarse rectangular grid, an appropriate definition for , where the spatial refinement correspond to the cell centroid coordinates of a fine grid. The penalty matrix P st in Eq (3) is still valid in this context, because its definition is independent of data structure. Moreover, the spatial composition matrix would be given by C s = C 2 � C 1 , where each C d is constructed according to the disaggregation of the coarse grid cells into small ones, for d = 1, 2. Although the Kronecker structure in (4) is a computational advantage, our model can cope with more complex situations where the spatial disaggregation changes from one time point to another (for example, census tracks changes along the years) by simply changing the structure of C st .
Mixed model representation. Now, we show how to incorporate fine-scale population information into the model (1). To do so, we must reformulate the model as a Generalized Linear Mixed Model (GLMM) by following the proposal of [13] as it is briefly described below.
In the P-splines literature, [13] show a nice representation of the spatio-temporal trend as a mixed model. Following their approach, the term f st (x 1 , x 2 , x 3 ) in model (2) can be rewritten as Thus, we have: where X and Z are fixed and random effects matrices, and β and α are their associated coefficients, respectively. Random effects have covariance matrix G that depends on the smoothing parameters. The model includes an offset term log(e f ) that allows the analysis of mortality or incidence rates instead of counts.
The vector e f could be, for example, fine-scale population information or expected number of deaths. If we only have the offset term at the coarse scale, i.e., logðeÞ ¼ ðlogðe 11 Þ; :::; logðe n1 Þ; :::; logðe 1T 1 Þ; ::::; logðe nT a ÞÞ 0 , a naive approach to estimate e f assumes that the elements of e are evenly distributed throughout the fine resolution. Therefore, we can compute these naive estimates as The construction of the matrices X, Z, and G in model (6) are obtained by using the singular value decomposition (SVD) of each discrete penalty matrix P d in Eq (3), for d = 1, 2, 3. The mixed model matrices are given by: where X d = B d U dn and Z d = B d U ds are constructed from the matrices of singular vectors corresponding to null and non-null singular values of the SVD of P d , U dn and U ds , respectively, for d = 1, 2, 3. DenotingΣ d as the diagonal matrix with the non-null singular values of the SVD of P d in the main diagonal, the inverse of the covariance matrix G becomes the blockdiagonal matrix: where: If data are spatially recorded over a rectangular coarse grid, the corresponding mixed model matrices are obtained as in Eq (7), but replacing the Box products □ by Kronecker products �. The formulation of G −1 remains the same.
Parameter estimation. Once the ST-PCLM defined in (6) is in the GLMM framework, it is possible to estimate its parameters. This estimation procedure was presented by [20] in a spatial disaggregation context, and it involves two interrelated stages: (a) estimation of fixed coefficients and random effects (β and α); and (b) estimation of smoothing parameters (λ 1 , λ 2 , and λ 3 ). The penalized quasi-likelihood (PQL) methods of [32] are used for stage (a), and the restricted (or residual) maximum likelihood (REML, [32,33]) is used for stage (b) as a numerical optimization criterion for smoothing parameter selection. Technical details are provided in [20] and, thus, we only describe here the necessary results.
For given values of λ 1 , λ 2 , and λ 3 , the estimation of the fixed and random effects coefficients of the model (6) are obtained by maximizing the following approximate penalized log-likelihood: where 1 denotes a vector of ones of length n � T a and μ = C st γ f , with γ f = e f � exp(X β + Z α). Differentiation of Eq (9) with respect to β and α leads to the score equations: working' mixed model matrices, since W = diag(μ) and Γ = diag(γ f ) change during the estimation procedure. Defining the working vector as: the solution of the score equations in (10) via Fisher scoring algorithm is expressed as the iterative solution of the system: where b = G −1 α. This yields to a modified version of the standard mixed model estimators: where: Conditioning on the estimates (12) and (13), the smoothing parameters λ 1 , λ 2 , and λ 3 can be estimated by maximizing the approximate REML (see [32,Eq. 13]): where λ 1 , λ 2 , and λ 3 are involved in V through G. Therefore, optimal estimates for the ST-PCLM parameters are obtained by iteration between Eqs (12), (13) and (16), until convergence.
Once the ST-PCLM parameter estimates at convergence are obtained, we can derive stan- Zb α by using the Bayesian approximation of the variance-covariance matrix for ð b β; b αÞ 0 (see [34] for further details). Thus, the approximate standard errors for b η are obtained by taking the square root of Varðb ηÞ, which is obtained as: We should note that the estimation procedure described above can be computationally inefficient if we are working with large datasets and the goal is to obtain estimates at a very fine scale. Furthermore, the direct creation of the matrices C sf , X, and Z in model (7) can easily lead to storage problems. In the next Sections, we will provide solutions for the ST-PCLM estimation in terms of storage and efficiency (these solutions were not available in [20]).
Fast algorithm for spatio-temporal penalized composite link models. In order to efficiently compute estimates of the smoothing parameters, we have adapted the SAP (Separation of Anisotropic Penalties) algorithm of [35] to the ST-PCLM context. To avoid possible storage problems, we have used the so-called GLAM (Generalized Linear Array Methods, [31,36]) in the ST-PCLM setting. These methods also provide an efficient way to compute the matrix of crossproducts required for the SAP algorithm, easing the computation time of model estimation.
Under the ST-PCLM approach and conditioning on the estimates given in Eqs (12) and (13), we can estimate λ 1 , λ 2 , and λ 3 by numerical maximization of the approximate REML in Eq (16). The usual algorithms used to approximate the solutions, such as the one proposed by [37] (extended to the generalized case by [38]), can only deal with situations in which the variance-covariance matrix is linear on the variance parameters. In the case of spatio-temporal models with anisotropic penalties, regression parameters are affected by more than one smoothing parameter, and so standard approaches can't be used, since the corresponding variance-covariance matrix does not have the required form. [35] proposed the SAP algorithm which provides a numerical solution to REML estimates, but it is able to deal with models that have a precision matrix for the random effect vector that is linear in the inverse of the variance parameters. These precision matrices are common when penalized smooth models with anisotropic penalties are reformulated as (generalized) linear mixed models, where the smoothing parameters are seen as ratios of variance components, i.e., l d ¼ � working under a Poisson framework, the dispersion parameter, ϕ, is equal to 1. Thus, the problem is reduced to obtain estimates for the variance components t 2 1 , t 2 2 , and t 2 3 . The reformulation of the SAP algorithm for the ST-PCLM is described below.
Following [35] we can derive closed-form expressions, from the approximate REML, for the variance components t 2 d , for d = 1, 2, 3. These estimates are given by: where: with N defined in Eq (15), and The non-null submatrices of each Λ d were previously defined in Section. Here the inverse of the covariance matrix G in model (6) is written in terms of t 2 d 's and can be decomposed as where the capital lambdas are defined above. The SAP algorithm for the ST-PCLM parameter estimation is given in Algorithm 1, which is an adaptation of the algorithm provided in [35, p. 945].
1: Set initial values for the mixed model coefficients β and α, and the variance components t 2 1 , t 2 2 , and t 2 3 (for example, b β ð0Þ ¼ 0 with length q 1 q 2 q 3 , b α ð0Þ ¼ 0 with length (c 1 c 2 c 3 −q 1 q 2 q 3 ), and b t 2ð0Þ Given the current estimates for the mixed model coefficients, construct the matrix of weights W and the working vector z as follows: 4: for 1 to maxit 2 do 5: Given the current estimates for the variance components, obtain new estimates for β and α by solving the system in Eq (11). The resulting estimates are denoted as b β ðkþ1Þ and b α ðkþ1Þ , respectively. 6: Obtain new estimates for the variance components using Eq (17).
The resulting estimates are denoted as b t 2ðkþ1Þ d , for d = 1, 2, 3.

7:
Compare new variance component estimates with the previous ones, using the following convergence criterion: � n 1 :

8:
If the convergence tolerance is achieved, break, otherwise set b t 2ðkÞ d ¼ b t 2ðkþ1Þ d and repeat steps 5, 6, and 7 until convergence. 9: end for 10: Compute a new estimate for the fine-scale smooth trend vector using the fixed and random effects estimates obtained in the last iteration of step 5. The resulting vector is denoted as b η ðkþ1Þ . Compare the new estimate with the previous one, using the following convergence criterion: kb η ðkþ1Þ À b η ðkÞ k 2 kb η ðkþ1Þ k 2 � n 2 :

11:
If the convergence tolerance is achieved, break, otherwise set k = k + 1 and repeat steps 2 to 11 until convergence. 12: end for Although we have suggested setting the maximum number of iterations to 100, our experience is that the number of iterations needed to achieve convergence is much smaller. In general, we have observed that the maximum number of iterations to obtain optimal variance components is greater than the number of iterations to obtain optimal estimates for the linear predictor. In the analysis of Q-fever data, convergence was reached after 30 iterations.
We should note that we can efficiently compute the trace in Eq (18) by taking into account that G Λ d G is a diagonal matrix. Thus, we only have to compute the diagonal ofZ 0 NZ to obtain this trace. From [39,Eq. (5.3)], we have that: An advantage of using the adapted SAP algorithm provided above is that we can directly compute the effective dimension (ED) of model (6). This model complexity measure is given by: where each ed d is computed from Eq (18). The first term on the right-hand side of Eq (19) corresponds to the dimension of the unpenalized (or fixed) part, whereas the second corresponds to the unpenalized (or random) part of the fitted model. To visualize the later statement, note that: whereZGZ 0 N is the 'hat matrix' [40] of the unpenalized part of the fitted ST-PCLM (see Eq (13)). Note that the matrix cross-productsX 0 WX,X 0 WZ, andZ 0 WZ (and its transpose) are involved in the computation of the variance component estimates. They also appear in the estimation of fixed and random effects coefficients, which are obtained by solving the system in Eq (11). These and other required matrix cross-products can be efficiently computed by adapting the so-called GLAM methods to the ST-PCLM setting, which we will show in the next Section.
GLAM methods for spatio-temporal penalized composite link models. When we deal with estimating latent trends in multiple dimensions, we are susceptible to problems with storage and computational burden. In the case of data arranged in multidimensional grids, these problems can be overcome using the GLAM methods developed in [31,36]. These methods are designed to avoid direct computation of matrix cross-products where Kronecker operations are involved, by using sequences of nested matrix operations. In this Section, we show the use of these methods in the ST-PCLM context.
Note first that they can be reduced as: Therefore, we only need to compute C st Γ x and C st Γ Z. These expressions are obtained as follows:

Results
Once the ST-PCLM has been stated, we analyzed the data related to Q fever outbreaks in the Netherlands.

Weekly high-resolution smooth incidence maps
As mentioned above, we have focused on the distribution of Q fever incidence in the south of the Netherlands during 2009. Some of this data were analyzed in [41], the authors focused on the construction of a smooth incidence map at a fixed time, without considering the need to obtain predictions that were consistent with the aggregated data. More recently [42] focused on the spatial transmission of the disease. Let's remember that the data are aggregated at municipality and monthly levels, and that our goal is to obtain estimates of Q fever incidence at that fine grid and at each week of 2009 (i.e., disaggregate simultaneously in space and time) to obtain better insights of the evolution of the disease. The ST-PCLM approach, allows to visualize Q fever incidence at a finer spatiotemporal resolution, and also to incorporate fine-scale population information to the estimation of the latent process. Here we use two sources of information: 1) The Q fever counts that are available in the municipalities of the study area described in Fig 2b and months of 2009 (i.e., the initial spatial and temporal scales are municipalities and months, respectively), and 2) population measured at a fine spatial grid over the study area. Population information at fine spatial grids is available from the WorldPop project. The data containing the spatial distribution of population in 2009 in the Netherlands can be downloaded from https://www. worldpop.org/geodata/summary?id=42722.  (6) is considered as a vector obtained by repeating the fine-scale population fifty three times. The elements of the spatial composition matrix are obtained using Eq (5), whereas the temporal composition matrix for this case has the following form:  1). Note also that most of the highest incidences in week 19 are spatially concentrated around the area that includes points A and C in Fig 4, which are located in the municipalities of Landerd and Heusden, respectively (see Fig 2b). Fig 5 shows the approximate standard error maps associated with Fig 4. As expected in a Poisson setting, larger variances are found in areas with higher incidence rates.
From the previous ST-PCLM estimates, we can also visualize the disaggregated (weekly) temporal evolution of the Q fever disease at specifics spatial coordinates of the fine grid. Fig 6 shows the down-scaled smoothed temporal incidence (per week) at three specific locations A, B, and C, in the study area. We observe that the temporal evolution of the incidence at point B

PLOS ONE
is constant and almost zero, whereas, at points A and C, the temporally smoothed incidence present a unimodal behaviour, where the peak is reached around week 19 (May). This is consistent with the summaries of the incidences given in Figs 1 and 2.

Simulation studies
In this Section, we present one of two simulation studies that have been carried out to examine the predictive performance of the model under different scenarios. In both of them, we have done spatial and temporal disaggregation, but the focus of each simulation exercise is different. The first aims to check the performance of the ST-PCLM approach under different degrees of spatial dependence. In the second, we will study the behaviour of the model under different levels of temporal disaggregation (results from this second study are available in the S1 File).
Simulation study 1. Data are generated using the fine grid in Fig 3a as the spatial region of study, and the 53 weeks in a year as the disaggregated temporal scale. The simulation is conducted as follows: 1. The fine-resolution incidence vector is constructed based on the smoothed Q fever incidences obtained in the previous section. We denote these smoothed incidences as inc(u k ), where u k , k = 1, . . ., K, with K = 4371 × 53 = 258163, represents the spatio-temporal coordinates at fine resolution. The different levels of spatial dependence are achieved by changing the values of the variance components that control the spatial term in the model, t 2 1 and t 2 2 (the variance component for the temporal term, t 2 3 , will remain fixed at the optimal value obtained in the fit). The values for the optimal variance components in the fit were: t 2 1 ¼ 218:7, t 2 2 ¼ 127:7, and t 2 3 ¼ 92:4. Based on these results we set 3 different scenarios: • Scenario 1: Variances for the spatial component are those in the fit.
• Scenario 2: Variances for the spatial component are 100 times larger than those in the fit.
• Scenario 3: Variance for the spatial component are 1000 times smaller than those in the fit.
2. Calculation of the aggregated expected number of cases (over municipalities and months), μ kl , k = 1, . . ., 72, l = 1, . . ., 12: 3. 100 realizations of the number of cases in each municipality and month are generated through a random drawing from a Poisson distribution with mean parameter μ. For all realizations l = 1, . . ., 100, the predicted incidences inc ðlÞ P g ðu k Þ obtained from the ST-PCLM approach for each scenario g, with g = 1, 2, 3, were compared to the smoothed incidences inc g (u k ). To evaluate the performance of the model we computed, for each scenario, the mean absolute error (MAE), the root mean squared error (RMSE): jinc ðlÞ P g ðu k Þ À inc g ðu k Þj; the correlation between the observed and predicted incidence, and the percent of grid cells with true incidence falling within 95% prediction intervals of the model. These metrics were averaged over 100 simulated data sets and they are summarized in Table 1.
The model performance criteria reported in Table 1 show good model performance in terms of prediction accuracy in all scenarios. All criteria showed that slightly worse predictions where obtained in scenario 2, this is mainly due to the fact that, for consistency, we have used the same size of B-spline basis for all scenarios (those used in the data analysis), but in the case of rapidly changing spatial patters (as is the case in scenario 2), a larger basis would be necessary to correctly capture the spatial effect, or adaptive P-splines [43] could be used; however, this approach is beyond of the scope of this paper.
In Fig 8 we provide further insight on the 95% coverage achieved in scenario 1 (similar plots for the other scenarios can be found in the S1 File). The lowest coverage over time (averaged over all pixels) was 85%, and was found at the extremes of the time interval (corresponding to the periods of lower incidence). The coverage for each pixel at the fine-grid resolution (averaged over time), shows that areas with lower coverage (60%-80%) correspond to locations with the highest number of cases. However, the coverage for these pixels is not constant over time, reaching almost 100% in weeks with low incidence rates. In scenario 2, the results are similar, although the coverage is lower (especially for a small number of pixels where the averaged coverage is between 20% and 30%), this is mainly due to the poor fitting for points with a sudden high incidence pick. The coverage in scenario 3 (the smoothest spatial trend) is nearly constant over the grid, ranging from 90% to 100%.
Taking into account the fact that we are predicting over more than 4000 grid cells and 53 weeks using aggregated data from 72 municipalities and 12 months, these results indicate a good predictive performance of the model.

Discussion
We have presented a novel model for the disaggregation of grouped data in both space and time, based on the spatio-temporal penalized composite link model approach. This framework was used to model Q fever counts (recorded in municipalities and months) to obtain Q fever incidence estimates over the fine grid and weeks. The model allows to obtain detailed trends in disease incidence, mortality risks, or any other vital rates at a desirable fine spatio-temporal resolution. Therefore, the resulting ST-PCLM outcomes can be displayed as a dynamic map. It also allows to include population information at a fine resolution in the estimation process.
The flexibility of the model is provided by the use of B-splines, along with a penalty on the regression coefficients, and the link between the areas and the fine-resolution grid is achieved through the composition matrix. Our proposal can also address situations in which population information is recorded over small spatial units that are nested in coarser ones (for example, from municipalities to census tracts). In that case, the centroids of the small units can be used to represent the fine spatial scale. Furthermore, the model allows the incorporation of covariates of interest (such as, for example, socio-economic, demographic, and environmental factors) in the ST-PCLM formulation to improve the estimation of the latent trend. They can be included at the aggregated level or at the fine-scale level (or at both levels simultaneously). The inclusion of covariates at the fine-scale level is immediate by adding them as columns in the design matrix X in (6) (if a linear relationship is assumed), or adding columns in Z if a non-linear relationship is expected. Details on how to include explanatory variables measured at the aggregated level can be found in [20].
It is important to acknowledge the use of GLAM methods in conjunction with the SAP algorithm, to avoid storage problems and to speed up computations. However, we are aware that the disaggregation of grouped data into a very detailed resolution could lead to increased computational load and storage problems. The sparsity of the marginal composition matrices can be exploited to deal with these issues (see, for example, [44]).
We have conducted two simulation studies to evaluate the prediction accuracy of the ST-PCLM. The first study aimed to check the performance of the model under different degrees of spatial dependence, and the second to asses the effect of different levels of temporal disaggregation. The overall performance of the model was good, although, as expected, it was affected by the number of units available at the aggregated level.
In the ST-PCLM formulation, we assume the spatial unit boundaries remain fixed over time. But it may happen that some of these boundaries change over the years. This issue is known in the statistical literature as the spatio-temporal misalignment problem. As future work, we plan to extend the ST-PCLM approach to handle this problem, where marginal composition matrices will play an important role. Furthermore, we can exploit the unique relationship between the penalties associated with basis and the covariance structure they yield to explore the use of other common spatio-temporal covariance matrices.
The implementation of our proposal and all data analyses were carried out using the statistical software R [45].